The following report presents a quantitative text analysis of “The Guardian” newspaper’s headlines in the periods going from 18 October 2021 and 28 November 2021 to apply a sentiment analysis around the negotiations of COP26 that took place in Glasgow from October 31st and November 12th. Such descriptive analysis aims at finding eventual changes in the opinions and/or political positioning of the abovementioned media outlet. ADD CONCLUSIONS
The 26th Conference of the Parties that took place in Glasgow has represented a crucial moment for climate policy negotiations.
The majority of the most influential international leaders attended the event to discuss on future global action regarding climate mitigation and adaptation, together with non-state actors and internationally renowned personalities. Such occasion gained substantial media attention from all over the world, with peaks in the intervals right before the starting of the COP (the so-called ‘PreCOP’ events), during the actual happening of the Conference, and right after the conclusion of such event.
However, media outlets would approach climate change in different ways that reflect their political positioning: the headlines, the highlights as well as the frequently mentioned topics would differ based on political position.
We collect data from the headlines of a British newspaper to analyse possible trends and changes in sentiment along the specific timeframe that goes from the weeks right before the Conference until the period right after it.
As public policy students concerned about climate negotiations, we are interested in investigating the opinions and attitudes expressed by media outlets in the above-mentioned periods of time. We are mainly concerned on whether the standpoint and the perspective of media outlets changed over time, and how the trends of this change could have developed.
The relevance of our analysis stands in our curiosity for newspapers’ behavior concerning international, critical occasions as the COP26. Understanding whether they actually aim to inform people with the addition of a particular sentiment (that could also aim at reflecting a general feeling from the readers), or whether they prefer to remain neutral and objectively report factual events, could foster a deeper comprehension of the role of information and media in climate change developments.
To accomplish that, we decided to analyse the sentiment of just one media outlet that is published in the COP26 host country, UK, that is The Guardian. This newspaper is considered as a left-leaning, according to YouGov findings. For instance, topics as climate financing were among the most critical ones put on the table of COP26 negotiations, hence choosing a non-neutral outlet - which would have endorsed such topic - can show more compelling results in terms of changes in the political positioning with respect to the outcomes of the Conference, that can eventually be reflected in the headlines.
The principal objective of this research aims at analysing trends in the attitude of the newspaper headlines with respect to COP26 topics. Then, the other questions related to this analysis are related to two macro areas. The first one tackles the original and main interest, that is whether the ratio of positive and negative words changes over time, and in which period this eventually happens. Questions related to this area are:
The second area concerns a more specific analysis that also takes into account how such results could change when using different measurement instruments, in this case, dictionaries. Questions related to this area are:
The text analysis is performed mainly using two different packages: tidytext and quanteda. Both packages follow the tidyverse design philosophy. The main difference between these two tools is that quanteda works with Corpus objects, proper of the NLP logic, while tidytext can process texts in their character format. We employed both tools to carry on all of our research questions in the most appropriate way. Specifically, tidytext was useful to build analyses and visualizations with dates, in a simpler manner than with the quanteda document level variables. The quanteda package was instead particularly useful for the targeted sentiment analysis we conducted, together with the fact that it was possible to check the consistency of results also with another dictionary, the LSD2015 one. The keywords in context function was also used as an explorative tool.
Being a newspaper of the host country of the climate negotiations, The Guardian would not represent an ideal sample of headlines that would allow us to deduce if COP26 has met the expectations or not through the sentiment analysis. Indeed, the results would only show the changes in opinions for the specific political leaning that such outlet represents. However, the values of this project are to apply procedures of sentiment analysis after scraping information from the web and present them to the user in an accessible format. Therefore, it is necessary to acknowledge the very limited scope of this analysis. The relevance of such investigation can only be applied to this specific and small sample.
Additionally, a further limitation concerns the dates that have been scraped from The Guardian website. Given the used web-scraping strategy, the most recent dates (December and end of November 2021) present some missing values caused by a heterogeneous format in the website pages. For demonstration purposes we simply dropped those missing values, further limiting the scope of the analysis.
The webscraping, cleaning and formatting section of the analysis can be found in the R script scraping_and_data_cleaning that is available in the repository.
The webscraping strategy adopted consists in downloading the headlines from multiple pages of the newspaper website by date (static webscraping). The formatting step includes transformation of dates into the correct format with lubridate and and data preparation for the quantitative text analysis with tidytext. In this part, words regarding the main topic of the headlines (“cop26”, “glasgow”,“climate”,“change”) were expected to be very frequent, other than not contributing to a specific senitment, so they have been removed as stopwords.
Through the exploration of the collected data, we aim at understanding which are the most frequent words and whether they could have a role in our investigation.
Thanks to a frequency table and an explorative WordCloud, we visualize the most frequent words. We identify ‘crisis’ as the most frequent word (other than the customized stopwords) used in the headlines during the COP26 period. Other very
| word | n |
|---|---|
| crisis | 58 |
| net | 52 |
| world | 50 |
| video | 48 |
| australia | 41 |
| johnson | 40 |
| happened | 38 |
| global | 34 |
| boris | 33 |
| emissions | 32 |
Thanks to the keyword in context function from the quanteda package, it is explored quickly whether there is any case in which the word ‘crisis’ has a different role from being part of the ‘climate crisis’ bigram. It is not found to be the case. Since the main topic of COP26 is exactly that of ‘tackling the climate crisis’, this word, despite clearly indicating a negative sentiment, does not represent relevant information. It is therefore dropped.
The sentiment analysis applied to the collected headlines is conducted using a dictionary-based method. The three used dictionaries are:
‘Bing et Al.’,
‘AFINN’
‘Lexicoder Sentiment Dictionary’ (LSD2015)
The choice of these dictionaries is mainly based on common practice and on the objective of our research to check the sentiment of the headlines around the climate negotiations, quantify them and detect any potential patterns and the consistency of these results.
From the tidytext package, we use the ‘Bing et al.’ and the ‘AFINN’ dictionaries. These are general-purpose lexicons based on unigrams (single words). The first one classifies the words into negative or positive, while the second one scales the sentiment by assigning a value between a range of -5 and +5, classifying words with values very negative and very positive respectively.
From the quanteda package, the Lexicoder Sentiment Dictionary represents a more than valid alternative, due to its particular versatility with respect to sentiment analysis for political communication (Young, L. & Soroka, S., 2012). Such dictionary consists of 2,858 ‘negative’ sentiment words and 1,709 ‘positive’ sentiment words. The novelty of Young and Soroka approach stands in a further set of 2,860 and 1,721 negations of negative and positive words, respectively. However, we did not find such additional set useful for our research purposes.
As explained above, ‘Bing et al.’ classifies words into positive or negative. Applying this dictionary to our dataset resulted in assigning sentiment values to 295 words which are distributed over the examined time period. Two vertical dashed lines have been added for visualization purposes, as to identify the specific period COP26 happened.
The figure clearly shows that the count of words with either a positive or a negative classification has increased during the negotiations period (October 31st – November 12th). Yet, through this graph it is quite difficult to deduce any trend about the general sentiment found in the headlines of ‘The Guardian’.
To take a closer look on the different frequencies of the classifide words, the following faceted barplot shows the most frequently occuring Words per sentiment. The frequency has been set to be strictly higher than 4 occurrencies per day, according to the sentiment. It is noticeable that negative words seem to be more frequent than positive ones.
Interestingly, ‘protest’ is considered as a negative word, but it can be argued that here the actual function of such word is very dependent on the context and the ideological positioning with respect to the climate crisis, as many protests during COP26 aimed at demanding more effective climate action.
After a first glance of the Bing et al. dictionary classification, it can be worth to see how such a classification of words is plotted over time. The following plots illustrate interactively the development of negative and positive words along the examined timeframe. Both plots show the count of each type of words by date. The first one shows the negative/positive ratio over time including non-classified words, while the second one is plotted excluding them, as to get a closer look on the changes of the counts.
Through this plot, one can clearly notice the surge of words in the headlines that were published when COP26 was actually happening, and that the prevalent speech of ‘The Guardian’ headlines was rather not classified.
This surge of wordcounts and prevalence of non-classified words especially during the Conference could have been expected, as during this period a local news outlet would cover extensively the negotiations and report to them in a rather non-subjective way.
In this second graph it is clear that negativity is prevailing over positivity of the headlines, especially after COP26.
As observed in the barplot showing the most frequently occurring words, negative unigrams like ‘protest’, ‘limit’, ‘poor’, etc… are strongly mentioned. This fact leads us to ask the following questions: - In the context of COP26 are protests negative? - Don’t protest movements like Fridays for Future have a positive impact on the climate crisis? - To which extent could these words be negative and biasing ?
In order to have a better understanding on how much negative or positive are the used words in The Guardian’s headlines, we make use of the AFINN dictionary.
Such lexicon assigns values from -5 to +5 to the words it classifies, ranging thus from very negative to very positive respectively.
The plotly package is used to visualize the values of negativity and positivity. This package offers the opportunity to compare data by hovering over the points (‘Compare data on hover’ button), making it possible to examine the word count by date for each sentiment value.
The results are not as bad as represented in the previous plots. Trying to understand the extent to which these headlines are classified negative or positively definitely provides a cleare and more consistent picture of the collected data. The negative trend seen in the results of the Bing et Al. dictionary makes more sense after seeing that the vast majority of the classified words falls in the -2:+2 score range.
It is remarkable that not a single word was assigned a value or -5, or a value of +5. The reason could stand in the fact that the validation for such dictionaries has been made combining crowdsourcing, restaurant or movie reviews, or Twitter data. Therefore, words corresponding to -5 or +5 could not exactly fit the writing-style of an official newspaper. In addition, headlines in particulare do not often contain strong wordings. It could be useful here to analyse words from the articles instead.
A similar application of the AFINN dictionary was applied to headlines related to the ‘Boris Johnson’ bigram, which also resulted to be quite present in the outlet headlines (see WordCloud). The following plot shows the targeted sentiment analysis for such bigram.
The last interactive plot shows well-balanced situation. The Prime Minister seemed to be on the more negative side of the spectrum before, during and after COP26. Yet, Johnson seems to navigate well between positivity and negativity.
To validate the consistency of our findings from the analysis made through the previous two dictionaries, a third dictionary - this time incorporated in the quanteda package - is used to perform the same kind of analysis. To avoid repetition of similar visualization of the result, with the LSD2015 dictionary we decided to show the distribution of sentiments across all the headlines, compared with the Boris Johnson targeted analysis. A distribution function of the sentiments would provide an indication above the homogeneity of the tone adopted by the newspaper.
The results of the LSD2015 sentiment analysis are fed into a logarithmic function to evaluate a final score of positivity over negativity of the used words which will allow us to see the distribution of these words.
As a result of the initial processing of the words, ‘Boris Johnson’ had earned a large spot of the headlines of the British outlet, being the Prime Minister of the host country of the COP26 negotiations. Therefore, a special study is conducted to assess the sentiment of the headlines related to the UK Prime Minister according to the used dictionaries so the reader would have an idea about the similarities and differences in the headlines related to Boris and the overall sentiment analysis of all the headlines.
The research conducted above has shown that regardless of the type of dictionary used to carry out the analysis, the general sentiments are the same – in all cases, the negatives outweigh the positives. Even before COP26 had begun, we can see that the Guardian reflected a rather pessimistic outlook regarding the climate conference. Despite seeing a surge in positive sentiments right after the conference started, the sentiment frequency plot by date shows that these are still counteracted by an overwhelming number of headlines with negative sentiments. This is further supported by the results of the AFINN dictionary analysis, which indicates a similar pattern: We see the circles coming in earlier for the negative sentiments, these sentiments also occur more frequently as indicated by the circle’s size. Whilst the sentiments are not distributed uniformly across time, it can be said that at every point in time, the guardian generally reflects a negative outlook on COP26.
As an additional part of our analysis, we considered the sentiments of the COP26 headlines that mention Boris Johnson and compare the sentiment of these, with all headlines. This is captured by the plots generated using the AFINN dictionary as well as the LSD2015 dictionary. The distribution of sentiments indicates that these two categories do behave similarly; however, the headlines mentioning Boris are less polarizing, as indicated by lower peaks and higher troughs. On top of that, the tails of the density plot, indicate that amongst the negative words, there were more extreme sentiments than compared to the most positive words. Based on the AFINN dictionary we obtain the same conclusion. In the “Boris plot” we see the majority of negative sentiments are accumulated around -2, whilst the analysis for all headlines even scores some values as low as -3 and -4.
Even though the Guardian did not publish exclusively negative headlines regarding COP26, these consistently outweigh the number of positively toned headlines. We suspect this could also be partially due to the sensationalist nature of newspapers. Quite naturally, a daunting, apocalyptic headline will generate more clicks and engagement which could skew journalists into dramatized headlines. Ultimately, in our discussion of these results, we realised that the Guardian generally reflects personal attitudes towards COP26 fairly accurately. On one hand, it is encouraging to see countries joining forces and coming up with solutions as to how we can best tackle climate change, and on the other hand, we feel an increasing frustration that whilst we are taking steps into the right direction, we know that these steps are not enough.
As stated above, we feel like a potential drawback of our research is the fact that it has been limited to headlines exclusively. Not only does this contstrain the context in which negative or positive words appear, it could also reflect an exaggerated sentiment. So, to have a more nuanced perspective on what the sentiments truly are, scraping could be expanded to include the actual articles themselves which would instantly increase our data set massively. Finally, it would have been interesting to consider the sentiment surrounding COP26 from other newspaper sources, either across the United Kingdom, or even more broadly speaking, across the world.